Audio-visual speech fragment decoding
نویسندگان
چکیده
This paper presents a robust speech recognition technique called audio-visual speech fragment decoding (AV-SFD), in which the visual signal is exploited both as a cue for source separation and as a carrier of phonetic information. The model builds on the existing audio-only SFD technique which, based on the auditory scene analysis account of perceptual organisation, works by combining a bottom-up layer which identifies sound fragments, and a model-driven layer which searches for fragment groupings that can be interpreted as recognisable speech utterances. In AV-SFD, the visual signal is used in the model-driven stage improving the ability of the decoder to distinguish between foreground and background fragments. The system has been evaluated using an audio-visual version of Pascal Speech Separation Challenge. At low SNRs, recognition error rates are reduced by around 20% relative to the performance of a conventional multistream AV-ASR system.
منابع مشابه
Using twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition
In this paper we propose the use of the recently introduced twinHMM-based audio-visual speech enhancement algorithm as a front-end for audio-visual speech recognition systems. This algorithm determines the clean speech statistics in the recognition domain based on the audio-visual observations and transforms these statistics to the synthesis domain through the socalled twin HMMs. The adopted fr...
متن کاملEfficient likelihood computation in multi-stream HMM based audio-visual speech recognition
Multi-stream hidden Markov models have recently been introduced in the field of automatic speech recognition as an alternative to single-stream modeling of sequences of speech informative features. In particular, they have been very successful in audio-visual speech recognition, where features extracted from video of the speaker’s lips are also available. However, in contrast to single-stream m...
متن کاملAudio-visual speech recognition system for a robot
Automatic Speech Recognition (ASR) for a robot should be robust for noises because a robot works in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve its robustness in such environments. This paper proposes AV integration for an ASR system for a robot which applies AV integration to Voice Activity Detection (VAD) and speech decoding. In VAD, we apply AV-integr...
متن کاملSpeech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs
This paper proposes a method to re-estimate output visual parameters for speech-to-lip movement synthesis using audio-visual Hidden Markov Models(HMMs) under the Expectation-Maximization(EM) algorithm. In the conventional methods for speech-to-lip movement synthesis, there is a synthesis method estimating a visual parameter sequence through the Viterbi alignment of an input acoustic speech sign...
متن کاملNTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition
Although audio-visual speech is well known to improve the robustness properties of automatic speech recognition (ASR) systems against noise, the realm of audio-visual ASR (AV-ASR) has not gathered the research momentum it deserves. This is mainly due to the lack of audio-visual corpora and the need to combine two fields of knowledge: ASR and computer vision. This paper describes the NTCD-TIMIT ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007